2025-01-03
Importance of human judgment, context-knowledge
High quality data, in comparison to automated methods such as dictionnaries
Highly time consuming, human labor intensive and costly
Often need to rely on a small sample of texts, which can be biased
| Machine Learning Lingo | Statistics Lingo |
|---|---|
| Feature | Independent variable |
| Label | Dependent variable |
| Labeled dataset | Dataset with both independent and dependent variables |
| To train a model | To estimate |
| Classifier (classification) | Model to predict nominal outcomes |
| To annotate | To (manually) code (content analysis) |
Licht et al. (2024) : A supervised learning workflow
# Getting a clean corpus of labelled texts
Use the test set to evaluate the performance of the model
The model is used to predict the categories of the texts in the test set
The predictions are compared to the true categories with different metrics
Accuracy : proportion of correctly classified texts (highly limited for imbalanced datasets)
\[ \ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \ \]
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
\[ \ \text{Recall} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Negative}} \ \]
\[ \ \text{Precision} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Positive}} \ \]
\[ \ \text{f1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \ \] ## Recall and precision trade-offs {.smaller}
| Problem | Solution |
|---|---|
| Unbalanced classes | Undersampling & oversampling |
| Not enough training data | More annotation |
| Bad quality of the training data | Better annotation |
| Bad quality of the text features | Better preprocessing |
| Limited text representation | Go for more complex models |
| Too complex concept | Accepting okay-ish performance |
## Challenges in supervised text classification
Supervised text classification